12 research outputs found
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
The role of aural frequency analysis in pitch perception with simultaneous complex tones
Pitch perception has always been a relatively important issue in psychoacoustic literature. In particular the problem of complex-tone pitch, which does not simply depend on any single spectral frequency, has been the object of much interest during the past century. Since Seebeck (1841) discovered that upper partials contribute significantly to the pitch of complex tones, several mechanisms have been proposed such as nonlinear distortion creating a difference tone (Helmholtz, 1863; Fletcher, 1924), interference between unresolved partials causing a periodic envelope pattern (Schouten, 1940; Plomp, 1967), or some form of central neural processing (Goldstein, 1973; Wightman, 1973; Terhardt, 1972). Most modern pitch theories agree that the pitch of a complex tone is directly or indirectly derived from spectral frequencies which are resolved in the cochlea
Quantifying sound quality in loudspeaker reproduction
We present PREQUEL: Perceptual Reproduction Quality Evaluation for Loudspeakers. Instead of quantifying the loudspeaker system itself, PREQUEL quantifies the overall loudspeakers' perceived sound quality by assessing their acoustic output using a set of music signals. This approach introduces a major problem: subjects cannot be provided with an acoustic reference signal and their judgment is based on an unknown, internal, reference. However, an objective perceptual assessment algorithm needs a reference signal in order to be able to predict the perceived sound quality. In this paper, these reference signals are created by making binaural recordings with a head and torso simulator, using the best quality loudspeakers, in the ideal listening spot in the best quality listening environment. The reproduced reference signal with the highest subjective quality is compared to the acoustic degraded loudspeaker output. PREQUEL is developed and, subsequently, validated, using three databases that contain binaurally recorded music fragments played over low to high quality loudspeakers in low to high quality listening rooms. The model shows a high average correlation (0.85) between objective and subjective measurements. PREQUEL thus allows prediction of the subjectively perceived sound quality of loudspeakers taking into account the influence of the listening room and the listening position
Quantifying sound quality in loudspeaker reproduction
We present PREQUEL: Perceptual Reproduction Quality Evaluation for Loudspeakers. Instead of quantifying the loudspeaker system itself, PREQUEL quantifies the overall loudspeakers' perceived sound quality by assessing their acoustic output using a set of music signals. This approach introduces a major problem: subjects cannot be provided with an acoustic reference signal and their judgment is based on an unknown, internal, reference. However, an objective perceptual assessment algorithm needs a reference signal in order to be able to predict the perceived sound quality. In this paper, these reference signals are created by making binaural recordings with a head and torso simulator, using the best quality loudspeakers, in the ideal listening spot in the best quality listening environment. The reproduced reference signal with the highest subjective quality is compared to the acoustic degraded loudspeaker output. PREQUEL is developed and, subsequently, validated, using three databases that contain binaurally recorded music fragments played over low to high quality loudspeakers in low to high quality listening rooms. The model shows a high average correlation (0.85) between objective and subjective measurements. PREQUEL thus allows prediction of the subjectively perceived sound quality of loudspeakers taking into account the influence of the listening room and the listening position
Enhancing the Quality of Service of mobile video technology by increasing multimodal synergy
Bandwidth is still a limiting factor for the Quality of Service (QoS) of mobile communication applications. In particular, for Voice over IP the QoS is not yet as good as for common, well-engineered, public-switched telephone networks. Multisensory communication has been identified as a possibility to moderate this limitation. One of the strengths of mobile video technology lies in its combination of visual and auditory modalities. However, one of the most salient features of mobile video applications is its small screen size. To test the potential of multimodal synergy for mobile devices, we assessed to what extent small screens affect multimodal synergy. This potential was assessed in an experiment with 54 participants, who conducted a standardised video-listening test for three talking-heads videos with a signal-tonoise ratio of –9 dB. The videos were presented on three different screen sizes, whilst keeping the video and auditory signals equal. Compared to a ground truth based on 359 participants, intelligibility was found to be significantly higher when using a large screen than when using a small screen. This indicates that mobile video technology has the potential for a significant multimodal synergy to which screen size is a substantial constraint. To optimally benefit from their multimodal potential, we offer suggestions on how to increase the effective screen size for small screen (e.g. mobile) devices and applications through elaborating the most relevant (visual) features. We conclude that knowledge about human sensory processing can alleviate the identified constraint and maximise the potential QoS of mobile video technology
Perceptual Evaluation Of Speech Quality (pesq) -- A
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay